I Apurv Sathwara, hereby state that I have not communicated with or gained information in anyway from any person or resource that would violate the College’s academic integrity policies, and that all work presented is my own. In addition, I also agree not to share my work in any way, before or after submission,that would violate the College’s academic integrity policies.
“R version 4.1.1”
Rstudio version-version 1.4.1717
Documentation of the data set:
■ attribution of the owner/creator of the data: https://www.kaggle.com/aljarah
■ links to the data: https://www.kaggle.com/aljarah/xAPI-Edu-Data?select=xAPI-Edu-Data.csv
Introduction
Load Dataset
## [1] 480 17
As can be seen here from the output obtained,the dataset consists of 17 columns. However,as the class is our response variable, we consider only the rest of the 16 features and try to visualise and understand their effect on class data set.
The purpose of this research is to Evaluate the Factors that May Affect Students’ Academic Performance.
This is an educational data set which is collected from learning management system (LMS) called Kalboard 360.
The dataset consists of 305 males and 175 females. The students come from different origins such as 179 students are from Kuwait, 172 students are from Jordan, 28 students from Palestine, 22 students are from Iraq, 17 students from Lebanon, 12 students from Tunis, 11 students from Saudi Arabia, 9 students from Egypt, 7 students from Syria, 6 students from USA, Iran and Libya, 4 students from Morocco and one student from Venezuela.
The dataset is collected through two educational semesters: 245 student records are collected during the first semester and 235 student records are collected during the second semester.
1 Gender - student’s gender (nominal: ‘Male’ or ‘Female’)
2 Nationality- student’s nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
3 Place of birth- student’s Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)
5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)
6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)
7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)
8 Semester- school year semester (nominal:’ First’,’ Second’)
9 Parent responsible for student (nominal:’mother’,’father’)
10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)
11- Visited resources- how many times the student visits a course content(numeric:0-100)
12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)
13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)
14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)
15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)
16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)
17 Class- The students are classified into three numerical intervals based on their total grade/mark: L(Low-Level):interval includes values from 0 to 69, M(Middle-Level):interval includes values from 70 to 89, H(High-Level):interval includes values from 90-100.
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
## corrplot 0.90 loaded
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: package 'ggthemes' was built under R version 4.1.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Warning: package 'randomForest' was built under R version 4.1.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'party' was built under R version 4.1.2
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:plyr':
##
## empty
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.1.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.1.2
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
## Warning: package 'plotly' was built under R version 4.1.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'rpart' was built under R version 4.1.2
## Warning: package 'rpart.plot' was built under R version 4.1.2
The Project
Data Inspecting
## 'data.frame': 480 obs. of 17 variables:
## $ gender : chr "M" "M" "M" "M" ...
## $ nation : chr "KW" "KW" "KW" "KW" ...
## $ birthplace : chr "KuwaIT" "KuwaIT" "KuwaIT" "KuwaIT" ...
## $ stageid : chr "lowerlevel" "lowerlevel" "lowerlevel" "lowerlevel" ...
## $ gradeid : chr "G-04" "G-04" "G-04" "G-04" ...
## $ sectionid : chr "A" "A" "A" "A" ...
## $ topic : chr "IT" "IT" "IT" "IT" ...
## $ semester : chr "F" "F" "F" "F" ...
## $ relation : chr "Father" "Father" "Father" "Father" ...
## $ raisedhands: int 15 20 10 30 40 42 35 50 12 70 ...
## $ n_visit : int 16 20 7 25 50 30 12 10 21 80 ...
## $ n_view : int 2 3 0 5 12 13 0 15 16 25 ...
## $ discussion : int 20 25 30 35 50 70 17 22 50 70 ...
## $ p_answer : chr "Yes" "Yes" "No" "No" ...
## $ p_satis : chr "Good" "Good" "Bad" "Bad" ...
## $ n_absent : chr "Under-7" "Under-7" "Above-7" "Above-7" ...
## $ class : chr "M" "M" "L" "L" ...
First of all, let’s see a glimpse of our Dataset
## Rows: 480
## Columns: 17
## $ gender <chr> "M", "M", "M", "M", "M", "F", "M", "M", "F", "F", "M", "M"~
## $ nation <chr> "KW", "KW", "KW", "KW", "KW", "KW", "KW", "KW", "KW", "KW"~
## $ birthplace <chr> "KuwaIT", "KuwaIT", "KuwaIT", "KuwaIT", "KuwaIT", "KuwaIT"~
## $ stageid <chr> "lowerlevel", "lowerlevel", "lowerlevel", "lowerlevel", "l~
## $ gradeid <chr> "G-04", "G-04", "G-04", "G-04", "G-04", "G-04", "G-07", "G~
## $ sectionid <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "B"~
## $ topic <chr> "IT", "IT", "IT", "IT", "IT", "IT", "Math", "Math", "Math"~
## $ semester <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"~
## $ relation <chr> "Father", "Father", "Father", "Father", "Father", "Father"~
## $ raisedhands <int> 15, 20, 10, 30, 40, 42, 35, 50, 12, 70, 50, 19, 5, 20, 62,~
## $ n_visit <int> 16, 20, 7, 25, 50, 30, 12, 10, 21, 80, 88, 6, 1, 14, 70, 4~
## $ n_view <int> 2, 3, 0, 5, 12, 13, 0, 15, 16, 25, 30, 19, 0, 12, 44, 22, ~
## $ discussion <int> 20, 25, 30, 35, 50, 70, 17, 22, 50, 70, 80, 12, 11, 19, 60~
## $ p_answer <chr> "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "Yes",~
## $ p_satis <chr> "Good", "Good", "Bad", "Bad", "Bad", "Bad", "Bad", "Good",~
## $ n_absent <chr> "Under-7", "Under-7", "Above-7", "Above-7", "Above-7", "Ab~
## $ class <chr> "M", "M", "L", "L", "M", "M", "L", "M", "M", "M", "H", "M"~
Data Preprocessing
Missing Values
## gender nation birthplace stageid gradeid sectionid
## 0 0 0 0 0 0
## topic semester relation raisedhands n_visit n_view
## 0 0 0 0 0 0
## discussion p_answer p_satis n_absent class
## 0 0 0 0 0
We found that there are NO missing values in Dataset. So, no need to remove anything.
The variables are thereby renamed for easy interpretation.
## gender nation birthplace stageid
## Length:480 Length:480 Length:480 Length:480
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## gradeid sectionid topic semester
## Length:480 Length:480 Length:480 Length:480
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## relation raisedhands n_visit n_view
## Length:480 Min. : 0.00 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.: 15.75 1st Qu.:20.0 1st Qu.:14.00
## Mode :character Median : 50.00 Median :65.0 Median :33.00
## Mean : 46.77 Mean :54.8 Mean :37.92
## 3rd Qu.: 75.00 3rd Qu.:84.0 3rd Qu.:58.00
## Max. :100.00 Max. :99.0 Max. :98.00
## discussion p_answer p_satis n_absent
## Min. : 1.00 Length:480 Length:480 Length:480
## 1st Qu.:20.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :43.28
## 3rd Qu.:70.00
## Max. :99.00
## class
## L:127
## M:211
## H:142
##
##
##
## gender nation birthplace stageid gradeid sectionid topic semester relation
## 1 M KW KuwaIT lowerlevel G-04 A IT F Father
## 2 M KW KuwaIT lowerlevel G-04 A IT F Father
## 3 M KW KuwaIT lowerlevel G-04 A IT F Father
## 4 M KW KuwaIT lowerlevel G-04 A IT F Father
## 5 M KW KuwaIT lowerlevel G-04 A IT F Father
## 6 F KW KuwaIT lowerlevel G-04 A IT F Father
## raisedhands n_visit n_view discussion p_answer p_satis n_absent class
## 1 15 16 2 20 Yes Good Under-7 M
## 2 20 20 3 25 Yes Good Under-7 M
## 3 10 7 0 30 No Bad Above-7 L
## 4 30 25 5 35 No Bad Above-7 L
## 5 40 50 12 50 No Bad Above-7 M
## 6 42 30 13 70 Yes Bad Above-7 M
## gender nation birthplace stageid gradeid sectionid topic semester
## 475 F Jordan Jordan MiddleSchool G-08 A Chemistry F
## 476 F Jordan Jordan MiddleSchool G-08 A Chemistry S
## 477 F Jordan Jordan MiddleSchool G-08 A Geology F
## 478 F Jordan Jordan MiddleSchool G-08 A Geology S
## 479 F Jordan Jordan MiddleSchool G-08 A History F
## 480 F Jordan Jordan MiddleSchool G-08 A History S
## relation raisedhands n_visit n_view discussion p_answer p_satis n_absent
## 475 Father 2 7 4 8 No Bad Above-7
## 476 Father 5 4 5 8 No Bad Above-7
## 477 Father 50 77 14 28 No Bad Under-7
## 478 Father 55 74 25 29 No Bad Under-7
## 479 Father 30 17 14 57 No Bad Above-7
## 480 Father 35 14 23 62 No Bad Above-7
## class
## 475 L
## 476 L
## 477 M
## 478 M
## 479 L
## 480 L
Exploratory data analysis
Gender
Let us try to find if there exists any relationship between the gender and the performance of the students.
It can be concluded that the female students have outperformed the male students according to the data we have. This is depicted visually in the graph above.
The graph reveals that Jordan and Kuwait are over-represented in our sample when compared to other nationalities. Egypt, Iran, Lybia, Morocco, Syria, Tunis, USA and Venezuela have very few observations.
Chemistry has the least diversity among all the topics. Most of the enrolled students in chemistry are from Jordan. Also, most of the students who have pursued IT are from Kuwait. Topics like French, English and Arabic have the most diversity.
Students have been performing really well in biology. We can also note that no student has scored less than 70 in Geology.
Most of the data available is for the middle school students and high school has very few observations. Next we will try to note if these levels have considerable effect on our class variable.
The school levels do not tend to affect the performance of the students considerably. Majority of the students score between the range of 70-89 irrespective of their schooling level.
The majority of the students seem to score between 70-89 irrespective of the semesters . However,the proportion of students scoring more than 89 is higher in the second semester.
The variables n_visit and raisedhands have quite a significant correlation between themselves. Hence, the students who have been visiting the resources continuously are more likely to raise hands in the classes than the ones who didnt.
The performance of the students seems to be dependent on the number of times they raised their hands. This can be a measure of their involvement in the class.
The female students can be observed to have raised hands more than the male students. This seems to concur with the idea that the number of hands raise is a potential factor to determine the academic performance of the students.
Geology can be said to have been the most engaging subjects of all. IT on the other hand has extremely less student participation.
Data Splitting
In classification the data set needs to be divided into trainig and test datasets.
Model training is performed using the training set and test set is used to assess the performance of the classification model.
Model Training
## Confusion Matrix and Statistics
##
## Predicted
## Truth L M H
## L 83 15 1
## M 9 142 19
## H 0 26 89
##
## Overall Statistics
##
## Accuracy : 0.8177
## 95% CI : (0.7754, 0.855)
## No Information Rate : 0.4766
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7162
##
## Mcnemar's Test P-Value : 0.3094
##
## Statistics by Class:
##
## Class: L Class: M Class: H
## Sensitivity 0.9022 0.7760 0.8165
## Specificity 0.9452 0.8607 0.9055
## Pos Pred Value 0.8384 0.8353 0.7739
## Neg Pred Value 0.9684 0.8084 0.9257
## Prevalence 0.2396 0.4766 0.2839
## Detection Rate 0.2161 0.3698 0.2318
## Detection Prevalence 0.2578 0.4427 0.2995
## Balanced Accuracy 0.9237 0.8183 0.8610
## Confusion Matrix and Statistics
##
## Predicted
## Truth L M H
## L 19 9 0
## M 2 31 8
## H 0 7 20
##
## Overall Statistics
##
## Accuracy : 0.7292
## 95% CI : (0.6289, 0.8148)
## No Information Rate : 0.4896
## P-Value [Acc > NIR] : 1.542e-06
##
## Kappa : 0.5802
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: L Class: M Class: H
## Sensitivity 0.9048 0.6596 0.7143
## Specificity 0.8800 0.7959 0.8971
## Pos Pred Value 0.6786 0.7561 0.7407
## Neg Pred Value 0.9706 0.7091 0.8841
## Prevalence 0.2188 0.4896 0.2917
## Detection Rate 0.1979 0.3229 0.2083
## Detection Prevalence 0.2917 0.4271 0.2812
## Balanced Accuracy 0.8924 0.7277 0.8057
The accuracy comes out to be 72% which is lower than the accuracy of the training set.
Conclusion
Que What I have Learned and Done.?
Ans. In the result it shows that trained data has the best accuracy as compared to the test data. Also, this algorithm use to make future prediction about studetns learning process and spot students who are unsuccessful. Moreover, from this output one can examine the studetns capabilities work on students who doesn’t performed well in the class as per the data given. The data firstly pre-processed and presented with different exploratory data analysis Second, a correlation analysis is used to investigate the relationships between certain characteristics and class variable and then splitted and modeled by a decision tree algorithm and give results as per the given data.
Que What I will do to improve and my thoughts.?
Ans. Some column will be deleted. Both Grade ID and Stage ID showed the educational stage of students, and the Stage ID was divided into 12 category, which is unessisarly difficult to analyst. Therefore, Stage ID will be deleted. And PlaceofBirth will be deleted out of similary reason, which is similary to Natuinallty. And SectionID which presents the classrooms belonging of students and Semester also will be deleted. I will divide the remaining 13 factors into three categories. 1.Demographic characteristics 2.Academic background characteristics 3. Behavior characteristics. At the same time, I will also analyze the possible internal connections between each column.